On a Reliable Peer-Review Process 



Arthur Carvalho Kate Larson 

Cheriton School of Computer Science Cheriton School of Computer Science 
University of Waterloo University of Waterloo 

a3carval@cs.uwaterloo.ca klarson@cs.uwaterloo.ca 

March 3, 2013 

"(N : 

' Abstract 

(N : 

5_, . We propose an enhanced peer-review process where the reviewers are encouraged to truthfully 

Oh' disclose their reviews. We start by modelling that process using a Bayesian model where the uncer- 

. tainty regarding the quality of the manuscript is taken into account. After that, we introduce a scoring 

' function to evaluate the reported reviews. Under mild assumptions, we show that reviewers strictly 

, maximize their expected scores by telling the truth. We also show how those scores can be used in 

order to reach consensus. 

h: 

O: 1 Introduction 

1 ^1 , Peer review is a process in which a work is scrutinized by a number of experts with relevant expertise in 

I order to ensure quality control and provide credibility. It is commonly used when there is no objective, 

^ ' universal way to measure the work's quality, e.g., when quality is a subjective matter. 

^T) ■ Peer review is one of the pillars of the current scientific communication. The canonical process is 

^ I well-described by Meadows [17|: when a manuscript first arrives at the editorial office of a journal, it 

is examined by the editor, who may reject it out of hand either because it is out of scope or because it 
is of such low quality that it cannot be considered at all. Manuscripts that pass this first stage are then 
^ i sent to experts with relevant expertise who are generally asked to classify each manuscript as publishable 

CN I immediately, publishable with minor reviews, or not publishable. Typically, the manuscript's author does 

■ not know the reviewers' identities, but the reviewers know the identity of the author 

_ ^ . In this paper, we focus on the peer-review process as used in the scientific communication since this 

^ I is, arguably, its most widely used application. Putting it in a different way, peer review can be seen as 

^ ■ a decision-making process where the reviewers serve as "cognitive inputs" in order to help a decision- 

- ■ ■ maker (chair, editor, etc.) to judge the quality of a work. A crucial point in this process is that it greatly 

depends on reviewers' honesty. In the canonical process, they have no incentives to tell the truth. Several 
resulting problems have been discussed in the literature, e.g., bias against female authors, authors from 
minor institutions, and non-native English writers ll28ll30l[T7ll24l [r91. 

Our major assumption is that bias is a deliberate act, instead of an intrinsic human behaviour This 
allows us to view such a problem as a lack of "truth-telling incentives". Our main contribution is an en- 
hanced peer-review process that encourages honesty. In detail, we model that process using a Bayesian 
model where the uncertainty regarding the quality of the manuscript is taken into account. We also pro- 
pose a scoring method for promoting truthfulness. It works by comparing reported reviews and rewarding 
agreement. Under the assumptions that reviewers are Bayesian decision-makers, that they cannot influ- 
ence other reviewers' reviews, and that they do not have meaningful information about the quality of 



in 



1 



the manuscript before reviewing it, we show that reviewers strictly maximize their expected scores by 
truthfully disclosing their reviews. 

In practice, such scores can be used to distinguish between good and bad reviewers. We also show 
how they can be used in order to combine reviews and reach consensus. 

1.1 Related Work 

Even though the use of peer review in scientific communication can be traced back almost 300 years, it 
was not until the early 1990s that research on that matter became more formalized [28 1. A key figure in 
this process was Stephen Lock, editor of the BMJ (formerly the British Medical Journal) between 1975 
and 1991. He opened up the debate on peer review as to whether the canonical process is actually the 
fairest and most effective method of making editorial decisions [16]. 

After that, several books about the scholarly communication and, consequently, the peer-review pro- 
cess have been published lfT7ll23ll22l[m i2l [30ll . Among the most common topics are the changes in the 
process over the twentieth century, its underlying costs, strengths and weaknesses. 

Scientists in the biomedical field have been in the forefront of research on the peer-review process, 
which is not surprising considering that in their field dependable quality-controlled information can be 
literally a matter of life and death. In particular, the staff of BMJ has been studying the merits and 
limitations of peer review over a number of years. Most of their work has focused on defining and 
evaluating review quality 1291 . and examining the effect of particular interventions on the quality of 
reviews |28 1. However, little to nothing has been done in order to detect or avoid bias. 

Currently, the most famous mechanism used to prevent bias is the double-blind review, in which both 
author and reviewers' identities are hidden. Indeed, it has been reported elsewhere that such practice 
reduces bias against female authors B]- However, its antagonists believe that knowing the author's 
identity makes it easier to compare the new manuscript with previously published papers, and it also 
encourages the reviewers to disclose conflicts of interest [1]. Moreover, Newcombe and Bouton |[T9l note 
that the reviewers unaware of the seniority of the authors very often provide less educational comments 
for the inexperienced ones. 

Another argument that undermines the benefits of double-blind reviewing is that the authorship of the 
manuscript is often obvious to a knowledgeable reader from context (self-referencing, subject, writing 
style, etc.) ll33l[T4llT0l . Furthermore, that mechanism does not prevent against certain types of bias, e.g., 
when a reviewer rejects new evidence or new knowledge because it contradicts established norms, beliefs 
or paradigms (also known as Semmelweis reflex [8 1). 

More closely related to our approach is the work done by Roos et al. 125 ,1 . They propose a maximum 
likelihood method for calibrating reviews by estimating both the bias of each reviewer and the unknown, 
ideal score of the manuscript. Their method does not attempt to prevent bias by rewarding truthfulness. 
Actually, it only adjusts reviews a posteriori so that they can be globally comparable. Bias is treated as 
the general rigour of a reviewer across all his reviews, which significantly differs from our definition. 

Much of our work is built on the literature concerning expert opinions. This field is generally con- 
cerned about how expert opinion is used, how an expert's uncertainty is or should be represented, how 
experts do or should reason with uncertainty, how to score the quality and usefulness of expert opinion, 
and how the views of several experts might be combined into a single, consensual assessment [5J. We 
study all these points in the context of the peer-review process. 

While the scoring aspect is traditionally used to measure the quality of assessments on the basis of a 
future observation, our main goal when scoring reviews is to promote truthfulness. When objective ver- 
ification is not possible, as in our setting, then economic measures (not necessarily money) should exist 
in order to encourage rational reviewers to truthfully disclose their reviews. We achieve that through a 



2 



scoring rale that simply compares reported reviews and rewards agreement. Thus, it bears a tenuous rela- 
tion to peer-prediction mechanisms [ 1 8 1 and, more generally, mechanism design with correlated private 
information {e.g., |fT3l). 

Over the years, several methods have been proposed to establish expert consensus (e.g., I61I91I201). 
One of the most famous one is the Delphi technique [15]. It is essentially an empirical procedure applied 
iteratively in a sequence of stages. The initial contributions from the experts are collected in the form 
of answers to questionnaires and their comments to these answers. After each stage, the experts are 
informed of the opinions of the others in the group and they are allowed to reassess their own opinions. 
Because of the empirical nature of the Delphi technique, it provides no conditions under which the 
experts can be expected to reach agreement or for terminating the iterative process. Under our method, 
consensus is always reached, and this fact does not depend on the reported reviews. Furthermore, it 
preserves privacy, i.e., reviewers' identities and reviews do not need to be disclosed to other members of 
the group. 

2 Model 

A manuscript is to be reviewed by a set of reviewers N = {1, . . . , n}, for n > 2. The quality of the 
manuscript is represented by a collection of c independent discrete random variables Xi, . . . , Xc taking 
on values between and v, for c > 1 and t; > 0. Each random variable Xj follows a categorical 
distribution parameterized by an unknown probability vector Clj = {coj^, . . . where ^ is the 

probability assigned to outcome k. 

Those random variables represent numerical scores of the manuscript when it is evaluated under 
different criteria {e.g., originality, technical soundness, relevance, etc.). Thus, the parameter c represents 
the total number of criteria and v represents the best evaluation score that the manuscript can receive 
under each criterion. 

In order to ensure autonomy, reviewers cannot influence other reviewers' reviews, i.e., they do not 
know each other's identity a priori and they are not allowed to communicate to each other during the 
reviewing process. Each review can be seen as if the underlying reviewer is privately observing one 
realization of each random variable, hence resulting in a total of c observations. Thus, our model captures 
the uncertainty of the reviewers regarding the quality of the manuscript. 

The truthful review made by each reviewer i G is denoted by f j = (fj i, . . . , c), where fj is 
his observation from Xj (or his evaluation of the manuscript under criterion j). We say that reviewer 
i is telling the truth when his reported review, rj = (rj i, . . . , rj c), is equal to his truthful review, i.e., 

= for every j G {1, . . . , c}. We make two major assumptions regarding reviewers' beliefs: 

• Non-informative Dirichlet priors. For each random variable Xj , there exists a common prior distri- 
bution over Q,j, i.e.,p {^j). We assume that this prior is a non-informative Dirichlet distribution. 

• Rationality. After observing the realizations of the random variables, every reviewer i ^ N updates 
his beliefs by applying Bayes' rule to the common priors, i.e.,p {Clj\rij), for every j G {1, . . . , c}. 

The first assumption means that reviewers do not have informative prior knowledge about the quality 
of the manuscript. We discuss the meaning and validity of such assumption in the following subsection. 
The second assumption implies that the posterior distributions are consistent with Bayesian updating. 
Since the reviewers are going to take actions (report reviews) based on those posterior distributions, we 
say that they are Bayesian decision-makers. 



3 



2.1 Dirichlet Distribution 



For our purposes, the Dirichlet distribution can be seen as a continuous distribution over parameter vec- 
tors of a categorical distribution. Since Qj is the unknown parameter of the categorical distribution that 
represents the quality of the manuscript under the jth criterion, then it is natural to consider a Dirichlet 
distribution as a prior for Qj. Its probability density function is given by: 

where a = (ao, . . . , a^,) is a vector of positive reals that determines the shape of the distribution, and 

^ r(ELo«fc)' 

and T{x) = {x — 1)!. For the above distribution, the expected value of outcome k is E[uoj^k\ = 
C(k/Ylx=o probability vector {E[ujj,Q\ , . . . , E[ujj,y]) is called the expected distribution regarding 

Clj. An interesting property of the Dirichlet distribution is that it is the conjugate prior of the categorical 
distribution [SJ, i.e., the posterior distribution p{ftj\a, rij) is itself a Dirichlet distribution. In detail, let 
H{x, y) be an indicator function, i.e.: 

After applying B ayes' rule to his prior p(rij|Q;), reviewer i's posterior distribution is thenp(rij |q;, rj j ) 
p{ilj\{aQ + H{rij,&),ai +H{rij, 1), . . . , H{rij,v))), /.e., belief updating can be expressed as an 
updating of the parameters of the prior distribution. Thereafter, reviewer i's updated expected distribution 
regarding ilj is: 



gp + -g(rjj,0) ai + H{rij, 1) + H{r, 



V 



We call the above probability vector reviewer i's posterior predictive distribution regarding Q,j be- 
cause it provides the distribution of future outcomes given the observed data. We denote it by 0j j = 
{9i,jfl, . . . , Oi,j,v)- In this way, we can regard the parameters ao, • • • , «!, as "pseudo-counts" from "pseudo- 
data", where each au can be interpreted as the number of times that the w^ fc -probability event has been 
observed before. We say that the Dirichlet distribution is non-informative (or uniform) when all of the 
elements making up the vector a have the same value. For simplicity's sake, we assume that this value 
is equal to 1. Thus, after observing the realization fj j, each element k of the probability vector 0jj is 
defined as follows: 

_ ak + H{fij,k) _ r ^ if = k, 
'''' ELo«x + l I i otherwise. 

Non-informative priors are used when there is no prior knowledge favouring one probability event 
over another. In our scenario, a non-informative prior means that reviewers have no meaningful infor- 
mation about the quality of the manuscript before reviewing it. Mathematically, that means that the 
expected distribution i?[rij|a], for every criterion j G {1, . . . , c}, is uniform over the set {0, . . . , v}. 
Consequently, that assumption implies that each reviewer's relevant information consists exclusively of 
his perceptions of the quality of the manuscript, which is how the peer-review process works (or should 
work) in practice. 



4 



2.2 Numerical Example 



In this subsection, we illustrate the concepts defined so far. The same example will be extended in 
subsequent sections in order to illustrate new concepts and results. Consider three reviewers (n = 3) and 
three categories (c = 3), where the top possible evaluation score in each of them is equal to four (v = 4). 
Suppose that reviewer 1, 2, and 3 observe, respectively, the following realizations: ri = (0, 1, 3), r2 = 
(0, 2, 3), and = (1, 1, 3). The resulting posterior predictive distributions (Equation O are shown in 
Table n 

Table 1: Numerical example: reviewers' posterior predictive distributions. 

i = 1 j = 2 J = 3 

('2 1 1 1 ('1 2 1 1 l^ n 1 1 2 i\ 

V6'6'6'6'6/ U'e'e'e'e/ V6'6'6'6'6/ 

(=) f'2 1 1 1 l^ ('1 1 2 1 l^ n 1 1 2 l^ 

^2,i U'6'6'6'6/ U'e'e'e'e/ U'e'e'e'e; 

(3i (I 2. I 1 l\ f'l 2 1 1 l^ n 1 1 2 l^ 

U'6'6'6'6/ V6'6'6'6'6/ V6'6'6'6'6/ 



3 Scoring Reviews 

Reviewers must report their reviews (realizations of the random variables) to a decision-maker. As stated 
before, they may misreport them due to personal interests, biases, etc. A consequence of our assumptions 
regarding reviewers' beliefs is that misreporting becomes a deliberate act, instead of an intrinsic human 
behaviour. If it is a deliberate act, then it can be changed through proper incentives whenever the un- 
derlying reviewers are rational (in an economic sense). Based on this observation, we propose a scoring 
method to evaluate reviews that encourages truthfulness. 

In short, under the assumptions made about autonomy and reviewers' beliefs, the resulting expected 
truth-telling scored are strictly maximized when reviewers truthfully disclose their reviews. In practice, 
such scores allow, for example, the decision-maker to blacklist (respectively, reward) reviewers with 
several low (respectively, high) scores. 

In the following subsections, we go into detail about our scoring method. It is based on strictly proper 
scoring rules, which are discussed next. It is interesting to mention that the scoring concepts introduced 
in this section will evolve into weights for combining reviews in Section 4. 

3.1 Scoring Rules 

Consider an uncertain quantity with possible outcomes oq , . . . , , and a probability vector q = (go , • • • , Q'z ) • 
A scoring rule i?(q, e) is a function that provides a score for the assessment q upon observing the event 

OelEl. 

A scoring rule is called strictly proper whenever an expert receives his maximal expected score if, 
and only if, his stated assessment q corresponds to his true assessment q = {%,... ,qz) l,26il . The 
expected score of q at q for a real value scoring rule i?(q, e) is: 

z 

E* [i2(q,e)] =5^gei2(q,e). (3) 

e=0 

'in order to prevent future misunderstandings, it is important to note the difference between evaluation score and truth- 
telling score: wfiile the first one is assigned to the manuscript as the result of a reviewer's review, the latter is given to the 
reviewers in order to promote truthfulness. 



5 



Arguably, the best known strictly proper scoring rules, together with their scoring ranges, are ll32l : 



logarithmic: i?(q,e) = log^e (— oOiO] 

z 

quadratic: i?(q, e) = 2ge - ^ [-1, 1] 

spherical: i?(q, e) = J^/ ^ [0, 1] (4) 

V Z^x-O 

An important property of strictly proper scoring rules is that they are still strictly proper under positive 
affine transformations, as shown below. 

Lemma 1. If R{ci,e) is a strictly proper scoring rule, then a positive affine transformation of R, i.e., 
xi?(q, e) + y,for x > and y £ "Si, is also strictly proper. 

Proof. We note that argmax ( xE* [i?(q, e)] + y\ = argmax E* [i?(q, e)], which is equal to the true 

q ^ ^ ^ q 

assessment q since i? is a strictly proper scoring rule. □ 
3.2 Truth-Telling Scores 

Scoring rules require an outcome, or a "reality", to score an assessment. If we knew a priori reviewers' 
truthful reviews, we could compare them to the reported ones and reward agreement. However, due to 
the subjective nature of the peer-review process, we are facing a situation where this objective truth is 
practically unknowable. 

Our final solution to that issue is to compare reviewers' reviews to their peers' reviews and reward 
agreement. Since reviewers observe realizations from similar random variables, the rationale behind our 
approach is to exploit such similarities in order to induce truthfulness. 

The first step towards computing truth-telling scores is to estimate 0j ,,■ (see Equation |2]l, for ev- 
ery reviewer i £ N and criterion j € {l,...,c}, based on the reported review rjj. Let = 
{4>i,j,o, • • • , 4'i,j,v) be such estimation, where: 

I 1^2 Otherwise. ^ ' 

Thereafter, the truth-telling score of each reviewer z e is computed as follows: 

Si= ^ ^xR{^ij,r^j) +y, (6) 

je{l,...,c} zj^i 

where x and y are constants, for x > and y G 3?. In words, our method rewards each estimated 
posterior predictive distribution by using a strictly proper scoring rule R and the reviews of the other 
reviewers as outcomes. In the end, we aggregate the resulting values by summing them up. Intuitively, 
we are considering reported reviews as the outcomes of uncertain events, namely the evaluation scores 
deserved by the manuscript, and rewarding estimated posterior predictive distributions as assessments of 
those values. 

In what follows, we show that under the assumptions previously made about autonomy and reviewers' 
beliefs, each reviewer strictly maximizes his expected truth-telling score by truthfully disclosing his 
review. 



6 



Proposition 1. Each reviewer i G N strictly maximizes his expected truth-telling score when rj = rj. 



Proof. Since the random variables representing criteria are independent and reviewers cannot affect their 
peers' reviews, we can restrict ourselves to show that reviewer i strictly maximizes E©- ^ [xR {^ij,rzj) + y], 
for z i, when rij = rij. Since i? is a strictly proper scoring rule, from Lemma 1 we have that: 

argmax Ee^_^ [xR r^j) + y] = 

By construction, = &ij if, and only if, rij = Vij (see Equation |2] and |5]l, thus completing the 
proof. □ 

Another way to interpret the above result is to imagine that each reviewer is betting on the review 
deserved by the manuscript. Since the only relevant information available to him are the realizations of 
the random variables {i.e., his own review), then his best strategy (in an expected sense) is to bet on that 
realizations, i.e., to bet on his truthful review. 

For our purposes, R can be any bounded strictly proper scoring rule. Henceforth, we assume that R is 
the spherical scoring rule (Equation S). By construction of the estimated posterior predictive distributions 
(Equation ID, it is easy to see that the term xR r^j) + y can take only two values: 

—I =±2 = \ + y iiri j = Tz i, 

I ^ = + y otherwise. 

v(^) +Ee=o(^;T2) / 

Hence, by setting set x = yjA + v and y = —1, those values become, respectively, 1 and 0, i.e., 
the resulting truth-telling scores do not depend on parameters of the model. Moreover, we obtain a very 
intuitive interpretation of such scores: whenever two reported evaluation scores are equal to each other, 
then the underlying reviewers are rewarded by one payoff unit. Thus, in practice, our scoring method 
works by comparing reviews and rewarding agreement. 

From the reviewers' perspective, it is clear that consensus is the best possible outcome. It is worth 
to mention that such outcome cannot come through collusions because our assumption about autonomy 
automatically guarantees that our method is collusion-free, i.e., a priori agreements on a single, common 
review, which would maximize reviewers' truth-telling scores, are not possible. 

Another interesting point to note is that we can reward different agreements in distinct ways. For 
example, if the decision-maker knows a priori that a particular reviewer j is trustworthy (respectively, 
untrustworthy), then he can increase (respectively, decrease) the reward of reviewers whose reviews are 
in agreement with reviewer j's review. Formally, that means that we can use different values for x and 
y for different reviewers i and z in Equation [6l Proposition 1 is not affected by that as long as x > 0, 
y G JR, and these values do not depend on the reported reviews. 



3.3 Numerical Example 

Proceeding with the previous example, suppose that reviewer 1 and 2 truthfully disclose their reviews, 
i.e., ri = f 1 = (0, 1,3), and ^2 = ^2 = (0,2,3). But, reviewer 3 misreports his review by reporting 
r3 = (4,4,4), thus overestimating the quality of the manuscript. The resulting estimated posterior 
predictive distributions are shown in Table |2] 



7 



Next, we compute reviewers' truth-telling scores (Equation |6ll. Recall that v = A. Thus, let x = 
-v/4 + 4, y = —1, and R be the spherical scoring rule (Equation ID. We then obtain the following truth- 
telling scores: si = 2,S2 = 2, and S3 = 0. In detail, the scores of reviewer 1 and 2 are due to the fact 
that ri 1 = r2,i and ri 3 = r2,3. Reviewer 3's score is equal to because there are no matches between 
his reported review and others' reported reviews. 

Table 2: Numerical example: estimated posterior predictive distributions. 







3 = 


1 




3 




2 






3 




3 








1 1 

6' 6- 


1 

' 6' 


\) 


n 2 

U' 6' 


1 

6' 


1 

' 6' 


\) 


(i 


1 

6' 


1 

6' 


2 

' 6' 


\) 




il 


1 1 
6' 6- 


1 

' 6' 


\) 


U' 6' 


2 
6' 


1 
' 6' 


\) 


(i 


1 

6' 


1 

6' 


2 

' 6' 


I) 


*3J 


ih 


1 1 
6' 6- 


1 

' 6' 


1) 


V6' 6' 


1 

6' 


1 
' 6' 


i) 


(i 


1 

6' 


1 

6' 


1 

' 6' 


1) 



4 Consensus 

In order to assist the decision-maker to assess the quality of the manuscript, we propose an adaptation of 
a classical approach to find a review that represents consensus among reviewers. We relax our model by 
allowing the aggregate evaluation scores to take any real value between and v. 

One of the most natural ways to aggregate evaluation scores is by taking the arithmetic mean of 
them. However, this approach does not take into account the scoring concepts introduced in the previous 
section, a fact which may favour untrustworthy reviewers. 

In our approach, every reviewer i £ N weights the reviews of all reviewers (including himself). We 
denote these weights by Wj = {wi^i, . . . ,Wi^n), where Yl^=i '^i,z = !> < Wi^z < 1, for every 
z £ N. Wi^z can be seen as the weight assigned by reviewer i to reviewer z's review. 

We note that several problems may arise if we elicit those weights directly, e.g., it would bring back 
the previously addressed issue of honest reporting. Hence, we derive weights from the original reviews 
by using a similar idea as in the computation of the truth-telling scores. In detail, the weight that a 
reviewer assigns to a peer's review is directly related to the number of matches between their evaluation 
scores: the higher the number of matches is, the greater that weight will be. The rationale behind this 
approach is that if reviewers care about their expected truth-telling scores, then they should prefer reviews 
that are closer to their own reviews, where closeness is measured by the number of similar evaluation 
scores. Formally, we have that: 

(EiG{i,...,c} + y) + i 

w^^z = f • (7) 

(EteJV I^ie{i,...,c} ^R ^tj) + yj + e 

In the numerator of the above fraction, we compute "how close" the reviews of reviewer i and re- 
viewer z are. In order to avoid consensus problems when there are no matches, we use a small recali- 
bration factor e G (0, 1] to adjust weights away from 0/1 extreme values. We compute the total number 
of matches between reviewer i's review and all reviewers' reviews in the denominator of Equation |7] 
Hence, it ensures that each weight lies between and 1. 

It is interesting to note that the weight that each reviewer assigns to himself is always the highest 
one. Furthermore, our method preserves privacy, i.e., reviewers' identities and reviews do not need to be 
disclosed to others. Finally, it is easy to see that X]"^^ Wi^z = 1> and < Wi^z < 1> for every i,z £ N, 
because of the recalibration factor e. 



8 



In the interest of achieving consensus, we make the assumption that if each reviewer i €^ N could 
learn others' reported reviews, then he would be willing to revise his own reported review, = (r^ i , . . . , rj_c), 

in order to accommodate the expertise of others. Formally, his revised review would be: r^-^^ = ^rj\^ , • • • , r^^^ 

where r^^j = J2zeN ^'^^ every j G {!,..., c}. In other words, each revised review is a linear 

combination of all reported reviews. The whole updating process can be written in a slightly more general 
form using matrix notation: r^^) = wr, where: 



w 





Wi^2 ■ 


■ ■ Wi^n 




W2,2 ■ 


■ ■ W2,n 


_ Wn,l 


Wn,2 ■ 




n,i 


n,2 • • 


■ n,c 


r2,i 


r2,2 ■ ■ 


• r2,c 


_ rn,i 


rn,2 ■ ■ 





and 



In order to arrive at consensus, reviewers would keep revising their reviews until r(^+^) = r^^') = 
w^r, for some value of x, i.e., when further revisions do not actually change any reviewer's review. 
This method was originally proposed by DeGroot ||6]. Using well-known properties of Markov chains 
(see Appendix), it can be shown that when x — t- oo, = , for every reviewer i,z G N, i.e., by 
iteratively revising their own reviews in the above manner, the reviewers all converge towards the same 
review. 

4.1 Numerical Example 

From the previous example, recall that ri = (0, 1, 3), r2 = (0, 2, 3), and = (4, 4, 4). Thus, we have 
that: 



1 3 
2 3 
4 4 4 

In Equation |71 let R be the spherical scoring rule (Equation Further, let x 
and e = 0.01. Thus, we obtain that: 



V4 + 4, y = -1, 



w 



0.5995 0.3999 0.0006 
0.3999 0.5995 0.0006 
0.0011 0.0011 0.9978 



If we focus on the first row of the above matrix, we will see that reviewer 1 assigns a high weight to 
himself and to reviewer 2, and a low weight to reviewer 3. That happens because the reviews reported by 
reviewer 1 and reviewer 3 are very unlike. We can draw similar conclusions from the other rows. If we 
apply DeGroot's method in order to achieve consensus, we obtain the following weights: 



w 



0.3845 0.3845 0.2310 
0.3845 0.3845 0.2310 
0.3845 0.3845 0.2310 



9 



and, consequently, the consensual review is represented by any row of the matrix w°°r because they are 
all the same (0.924,2.078,3.231). It is worthwhile to discuss two interesting points in this example. 
First, we note that if consensus was based on the arithmetic mean of reviews, the consensual review 
would be (1.333, 2.333, 3.333). Hence, reviewer 3 would have more impact on the aggregate review. In 
our approach, his influence is diluted because his review is very dissimilar to others' reviews or, putting 
it in another way, his truth-telling score is low. 

Lastly, we observe that the exact value of e does not matter as much, as long as it is sufficiently small. 
For example, for any e greater than and less than 0.0 1, the consensual review will be approximately 
the same in the above example. Thus, small differences in the value of e have a negligible impact on the 
consensual review. 

5 Discussion 

We proposed an enhanced peer-review process that encourages honest reporting. In detail, we modelled 
that process using a Bayesian model where the uncertainty regarding the quality of the manuscript is taken 
into account. The main assumptions in our model are that reviewers do not have meaningful information 
about the quality of the manuscript before reviewing it, they cannot be influenced by other reviewers, 
and they are Bayesian decision-makers. The first two assumptions are in agreement with traditional 
practices fTTl. The last one, although not necessarily realistic, is common in the related literature, e.g., 
game theory [21] and multiagent systems |[27l . 

We also proposed a scoring method that evaluates reviews in order to promote truthfulness. We argue 
that it is feasible and practical since it works by comparing the reported reviews and rewarding agreement. 
Under the aforementioned assumptions, we showed that reviewers strictly maximize their expected scores 
by truthfully disclosing their reviews. We are currently testing the efficacy of our approach using real- 
world data. 

Finally, we proposed an adaptation of DeGroot's method for finding consensus. Intuitively, it works 
as if the reviewers were going through several rounds of discussion, where in each of them they are 
informed about others' reviews, and they update their own reviews in order to reach consensus. At 
the end of the process, the decision-maker has a value for each criterion that represents the consensual 
evaluation score. 

In practice, our method preserves privacy, i.e., both reviewers' reviews and their identities do not 
need to be disclosed to others. Our approach is rather general in that it can be applied to scenarios other 
than the peer-review process, e.g., forecasting. A generalization of this method considering different 
closeness metrics and also dynamic updating of weights is the subject of ongoing research. 

In the following subsections, we propose an extension of our model where reviewers may observe 
several realizations of each random variable before reviewing the manuscript. We also discuss about dif- 
ferent perspectives on bias and how they differ from the one adopted in this work. Finally, we talk about 
practical issues related to the application of our method to find consensus in the peer-review process. 

5.1 Multiple Realizations 

Very often in the peer-review process, each evaluation criterion is broken down into many subcriteria. 
For example, a criterion related to grammar and style may be broken down into a subcriterion concerning 
grammatical and spelling problems and a subcriterion regarding the clarity of the author's writing style. 
Another common example is the existence of only one major criterion {e.g., overall score) which is 
divided into a number of subcriteria {e.g., relevance, organization, soundness, etc.). 



10 



We can capture those settings in our model by allowing each reviewer to observe multiple realizations 
of each random variable, where each realization can be seen as an evaluation score given a particular 
subcriterion. Formally, let pi^j G Z"*" be the number of realizations observed by reviewer i from the 
random variable representing criterion j, Xj. Instead of a single number, fij is now a vector, i.e., Vij = 
(fjj^i, . . . , rjj-^p. ^.), where j ^ G {0, . . . , v}, for G {1, . . . , Pij}. The basic assumptions (autonomy, 
non-informative Dirichlet priors, and rationality) are still the same. Since we have multiple realizations 
for each reported evaluation score, we need to redefine our truthfulness concept. Given a criterion j, we 
now say that reviewer i is truthfully reporting his review when: 

nj = argmax ^ fij, fe), 

x£{0,...,v} f^^^ 

where H{x, y) is the indicator function in Equation[T] In words, reviewer i reports the most common re- 
alization coming from Xj. Ties are broken randomly. In this new model, reviewer i's posterior predictive 
distribution regarding fij, 0j j = {Oijfi, ■ ■ ■ ■> Gi,j,v) (Equation |2]l, is now defined as: 



/ Pi,3 Pi.j \ 

k=l k=l 



ax + Pi,j "a; + Pi,j 

\ x=0 x=0 ) 



(8) 



where = 1, for x G {0, . . . , v\, due to the assumption of non-informative priors. Interestingly, we can 
apply the same method proposed in Section 3.2 (Equation |6ll to compute truth-telling scores under this 
new model and our major result regarding truthfulness is still valid, namely, reviewers strictly maximize 
their expected truth-telling scores by telling the truth. We show in the following proposition that this 
result holds even if the decision-maker does not know a priori the number of observed realizations. 

Proposition 2. When observing many realizations, each reviewer i £ N strictly maximizes his expected 
truth-telling score when, Vj G {1, . . . , c},rij = argmax X]fc=i ^{^j ^i,j,k)- 

x£{0,...,v} 

Proof. Since the random variables representing criteria are independent of each other and reviewers 
cannot affect their peers' reviews, we can restrict ourselves to show that reviewer i strictly maximizes 
E©. ^ [xi? r^j) + y\ when rjj = argmax Ylk=i^{^^^i,j,k)^ for z / i and 0jj as defined in 

x£{0,...,v} 

Equation [8] Without loss of generality, let 

Pi,j 

argmax ^ H{x, rij^k) = 0, 

i.e., "0" is the most common realization coming from Xj. Consequently, Oij^ > 9ij^x, for x G 
{1, . . . , v}. In this proof, consider the following notation. Let j = (0jj,Oi ■ ■ ■ , 4'i,j,v) be reviewers i's 
estimated posterior predictive distribution when he is telling the truth, i.e.. 



H,j,k 



d-2 iffc = 0, 
-To Otherwise. 



For contradiction's sake, suppose that reviewer i maximizes lE©^ ^ [xR j , r ^ + y] by misreport- 
ing his review. Without loss of generaUty, assume that he reports rij = 1. Since "1" is not the most 



11 



common realization observed by reviewer i from Xj, then Oij^ > 9ij^i. Let = {4>i,jfl, • • • , 4'i,j,v) 
be reviewer i's estimated posterior predictive distribution when he is misreporting his review, i.e., 



(pi,j,k 



2 ifA; = l, 
otherwise. 



v+2 



It is important to note that 4>i,j,k = 4>i,j,k for /c G {2, . . . , v}. A consequence of our assumption that 
reviewer i maximizes his expected score by misreporting his review is that: 



E 



xR ( *i,j,r2j ) + y 



> E©- . 



xR ^*jj,r2j) +y 
R (^i.i,r. 



Let ii be the spherical scoring rule (Equation Then, the above inequality becomes (see Equa- 
tion O: 



k=0 
1 



4>i,j,k 



Z]a:=o('^*J>2:) 
4>i,j,k 



■V 7 

— / j k — — 

— / j k — — 

^=0 ' ' Jz:=oik,,^)' 



k=0 \/Ill=oi^i,j,^y 



The second and third lines follow from the facts that 



E 



)2 and that 



for A; G {2, . . . Regarding the last line, we have by construction that 



^,(i>i,jfi = <^ij,i = f^ij-.i = Consequently, we obtain that Oij^i > Oij^. As we stated 
before, since "0" is the most common realization observed by reviewer i from Xj, then 0j o > 



Thus, we have a contradiction. So, E© 



+ y 



< E« 



reviewer i strictly maximizes his expected truth-telling score by telling the truth. 



, I.e., 
□ 



Numerical Example 

For illustration's sake, consider a review process with one criteria and five subcriteria, i.e., c = 1 and, for 
every i £ N, i = 5. Suppose that reviewer i observes the following realizations: rj i = (1, 4, 3, 4, 4), 
where the top possible evaluation score is equal to four (i.e., v = 4). Consequently, reviewer i's posterior 
predictive distribution regarding fti is 0j i = (^j^, j^, and reviewer i's truthful review is 4 

since this is the most common realization from Xi. 



5.2 Bias and Truth- Telling Scores 

A consequence of our assumptions about the reviewers' beliefs is that bias becomes a deliberate act, i.e., 
reviewers can actually choose between lying or not. This allowed us to see such a problem as a lack of 
"truth-telling incentives". 



12 



However, we note that this assumption is not commonplace in cognitive science and, more specif- 
ically, social psychology. In detail, cognitive bias is the general term that is used to describe many 
subconscious procedures in the human mind that lead to perceptual distortion, inaccurate judgement, or 
illogical interpretation ||T2| . 

In this way, the extent to which truth-telling scores can actually change the behaviour of reviewers 
is remained to be seen. Controlled experiments are currently underway to investigate the efficacy of our 
method in real-world scenarios. 

5.3 Consensus and Discussion Phase 

Differently from scientific journals, academic conferences typically have a target number, or at least a 
target range, of manuscripts to accept due to budget, space, and time constraints |7 ]. Hence, a discus- 
sion phase is often held soon after the reviewers report their reviews, where the program committee 
collectively rates each paper as being either worthy or unworthy of acceptance based on the aggregate 
judgement of the reviewers. Our method to compute consensus can be seen as a tool to automate such 
procedure. 

It is very convenient when there ai^e too many reviewers, a fact which makes the consensus building 
process very expensive and complex, or when consensus was not reached before (for a real-Ufe example, 
see CI). 

We observe that discussion phases may be helpful to correct some adverse effects resulting from our 
scoring approach. For example, consider the case where a single reviewer spots several problems with 
the manuscript, which were missed by the others. Consequently, that reviewer would likely receive a low 
truth-telling score. By having discussion phases, the resulting scores in such scenarios can be corrected. 
A formal extension of our model and scoring method that deals with such settings is the subject of 
ongoing research. 

References 

[1] Working double-blind. Nature, 451(7179):605-606, 2008. 

[2] R. E. Abel and L. W. Newlin. Scholarly publishing: books, journals, publishers, and libraries in 
the twentieth century. John Wiley, 2002. 

[3] J. Bernardo and A. Smith. Bayesian theory. Wiley, 1994. 

[4] A. E. Budden, T. Tregenza, L. W. Aarssen, J. Koricheva, R. Leimu, and C. J. Lortie. Double- 
blind review favours increased representation of female authors. Trends in Ecology & Evolution, 
23(l):4-6, 2008. 

[5] R. M. Cooke. Experts in uncertainty: opinion and subjective probability in science. Oxford Uni- 
versity Press, 1991. 

[6] M. H. DeGroot. Reaching a consensus. Journal of the American Statistical Association, 
69(345): 11 8-121, 1974. 

[7] J. R. Douceur. Paper rating vs. paper ranking. ACM SIGOPS Operating Systems Review, 43: 1 17- 
121, 2009. 

[8] W. Edwards. Conservatism in human information processing. In B. Kleinmutz, editor. Formal 
Representation of Human Judgment, pages 17-52. John Wiley and Sons, New York, 1968. 



13 



[9] E. Eisenberg and D. Gale. Consensus of subjective probabilities: The pari-mutuel method. The 

Annals of Mathematical Statistics, 30(1): 165-168, 1959. 

[10] M. E. Falagas, G. M. Zouglakis, and P. K. Kavvadia. How masked is the "masked peer review" 
of abstracts submitted to international medical conferences? Mayo Clinic Proceedings, 81(5):705, 
2006. 

[11] E. H. Fredriksson. A Century of Scientific Publishing. lOS Publishing, 2001. 

[12] R. J. Heuer. Psychology of Intelligence Analysis. Center for the Study of IntelUgence, 1999. 

[13] S. Johnson, J. W. Pratt, and R. J. Zeckhauser. Efficiency despite mutually payoff-relevant private 
information: The finite case. Econometrica, 5 8(4): 873-900, 1990. 

[14] A. C. Justice, M. K. Cho, M. A. Winker, J. A. Berlin, D. Rennie, and T. P E. E. R. Investigators. 
Does Masking Author Identity Improve Peer Review Quality?: A Randomized Controlled Trial. 

Journal of the American Medical Association, 2 80(3): 240-242, 1998. 

[15] H. A. Linstone and M. Turoff. The Delphi method: techniques and applications. Addison-Wesley, 
1975. 

[16] S. A. Lock. A Difficult Balance: Editorial Peer Review in Medicine. Nuffield Provincial Hospitals 
Trust, 1991. 

[17] A. J. Meadows. Communicating research. Academic Press, San Diego, 1998. 

[18] N. Miller, P. Resnick, and R. Zeckhauser. Eliciting informative feedback: The peer-prediction 
method. Management Science, 51(9): 1359-1373, 2005. 

[19] N. S. Newcombe and M. E. Bouton. Masked reviews are not fairer reviews. Perspectives on 
Psychological Science, 4(l):62-64, 2009. 

[20] T. Norvig. Consensus of subjective probabilities: A convergence theorem. The Annals of Mathe- 
matical Statistics, 38(l):221-225, 1967. 

[21] M. J. Osborne and A. Rubinstein. A Course in Game Theory. The MIT Press, 1994. 

[22] G. Page, R. Campbell, and A. J. Meadows. Journal Publishing. Cambridge University Press, 2006. 

[23] R. P. Peek and G. B. Newby. Scholarly publishing: the electronic frontier. MIT Press, Cambridge, 
Mass., 1996. 

[24] R. B. Primack and R. Marrs. Bias in the review process. Biological Conservation, 141(12):2919- 
2920, 2008. 

[25] M. Roos, J. Rothe, and B. Scheuermann. How to Cahbrate the Scores of Biased Reviewers by 
Quadratic Programming. In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence, 
pages 255-260, 2011. 

[26] L. J. Savage. EUcitation of personal probabilities and expectations. Journal of the American Statis- 
tical Association, 66(336):783-801, 1971. 

[27] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical 
Foundations. Cambridge University Press, 2009. 



14 



[28] S. van Rooyen. The evaluation of peer-review quality. Learned Publishing, 14(2):85-91, 2001. 

[29] S. van Rooyen, N. Black, and F. Godlee. Development of the review quality instrument (rqi) for 
assessing peer reviews of manuscripts. Journal of Clinical Epidemiology, 52(7):625 - 629, 1999. 

[30] A. Weller. Editorial peer review: its strengths and weaknesses. American Society for Information 
Science & Technology, 2001. 

[31] R. L. Winkler. Scoring rules and the evaluation of probability assessors. Journal of the American 
Statistical Association, 64(327): 1073-1078, 1969. 

[32] R. L. Winkler and A. H. Murphy. Good probability assessors. Journal of Applied Meteorology, 
7(5):751-758, 1968. 

[33] A. Yankauer. How bUnd is bUnd review? American Journal of Public Health, 81(7):843-845, 
1991. 



15 



